236 research outputs found

    Internet delivery of time-synchronised multimedia: the SCOTS projects

    Get PDF
    The Scottish Corpus of Texts and Speech (SCOTS) Project at Glasgow University aims to make available over the Internet a 4 million-word multimedia corpus of texts in the languages of Scotland. Twenty percent of this final total will comprise spoken language, in a combination of audio and video material. Versions of SCOTS have been accessible on the Internet since November 2004, and regular additions are made to the Corpus as texts are processed and functionality is improved. While the Corpus is a valuable resource for research, our target users also include the general public, and this has important implications for the nature of the Corpus and website. This paper will begin with a general introduction to the SCOTS Project, and in particular to the nature of our data. The main part of the paper will then present the approach taken to spoken texts. Transcriptions are made using Praat (Boersma and Weenink, University of Amsterdam), which produces a time-based transcription and allows for multiple speakers though independent tiers. This output is then processed to produce a turn-based transcription with overlap and non-linguistic noises indicated. As this transcription is synchronised with the source audio/video material it allows users direct access to any particular passage of the recording, possibly based upon a word query. This process and the end result will be demonstrated and discussed. We shall end by considering the value which is added to an Internet-delivered Corpus by these means of treating spoken text. The advantages include the possibility of returning search results from both written texts and multimedia documents; the easy location of the relevant section of the audio file; and the production through Praat of a turn-based orthographic transcription, which is accessible to a general as well as an academic user. These techniques can also be extended to other research requirements, such as the mark-up of gesture in video texts

    A generic application for corpus management and administration

    Get PDF
    Our corpus project is building a digital collection of both written and spoken texts. The corpus is a publicly available resource, mounted on and searchable via the Web. This paper will describe the corpus management and workflow administration methods that the project has developed and the technologies used. We believe that the structures we have created to manage the different parts of the administration of the project are the basis for a re-usable, generic package for scholars building an online corpus from new linguistic materials

    The Scottish corpus of texts and speech

    Get PDF

    Corpus of Modern Scottish Writing (CMSW)

    Get PDF
    This poster describes the online Corpus of Modern Scottish Writing (1700-1945), being created at the University of Glasgow. The corpus fills the chronological gap between the Helsinki Corpus of Older Scots (1375-1700) and the Scottish Corpus of Texts and Speech (1945-present). The period covered by CMSW is an important time in the history of Scotland and Scots. It begins with the last stages of the standardisation of written English and the onset of the ‘Vernacular Revival’ in literary Scots. Out of the interaction between Broad Scots and written Standard English, the hybrid prestige variety of today’s Scottish English is said to emerge: CMSW will allow researchers to substantiate this claim, among many others. Once complete, CMSW will contain at least 4 million words of text, with accompanying metadata, covering a range of genres, including personal writing, administrative prose, verse and drama, and the writings of language commentators

    ComPair: compare and visualise the usage of language

    Get PDF
    This paper will demonstrate ComPair, a new tool to investigate and compare word usage, encouraging new ways to explore language variation. While remaining focussed on the usability and the promotion of navigation, this tool represents an evolutionary step forward from the author’s previous award winning visualisation applications. This paper will introduce the methods and technologies at its core, perform a demonstration of the tool and discuss opportunities for further collaboration

    Computational challenges, innovations and future of Scottish corpora

    Get PDF
    This chapter discusses the computational challenges and innovations encountered in the development of the Scottish corpora (the Scottish Corpus of Texts & Speech and the Corpus of Modern Scottish Writing), considers how tools for corpus analysis can encourage new audiences and complement existing resources, and explores possible future technological advances for corpus creation and exploitation

    Glimpses through the clouds: collocates in a new light

    Get PDF
    This paper demonstrates a web-based, interactive data visualisation, allowing users to quickly inspect and browse the collocational relationships present in a corpus. The software is inspired by tag clouds, first popularised by on-line photograph sharing website Flickr (www.flickr.com). A paper based on a prototype of this Collocate Cloud visualisation was given at Digital Resources for the Humanities and Arts 2007. The software has since matured, offering new ways of navigating and inspecting the source data. It has also been expanded to analyse additional corpora, such as the British National Corpus (http://www.natcorp.ox.ac.uk/), which will be the focus of this talk

    SCOTS: Scottish Corpus of Texts and Speech

    Get PDF
    This chapter examines the approaches to collection, handling and analysis of data in the Scottish Corpus of Texts and Speech

    SCOTS Project (Scottish Corpus of Texts and Speech)

    Get PDF

    DiaView: visualise cultural change in diachronic corpora

    Get PDF
    This paper will introduce and demonstrate DiaView1, a new tool to investigate and visualise word usage in diachronic corpora. DiaView highlights cultural change over time by exposing salient lexical items from each decade or year, and providing them to the user in an effortless visualisation. This is made possible by examining large quantities of diachronic textual data, in this case the Google Books corpus (Michel et al., 2010) of one million English books. This paper will introduce the methods and technologies at its core, perform a demonstration of the tool and discuss further possibilities
    • 

    corecore